Index

  1. Data
  2. Method (explained based on LM2)
  3. Results for other cases

Data

source('../../workflow/resources/annotateVariants.R')
sampleName <- 'Br26'
inputFolder <- '/cluster/work/bewi/members/jgawron/projects/CTC/input_folder'

annotations <- annotate_variants(sampleName, inputFolder)

Mutation distance matrix

For each cluster (defined by color), we computed a pairwise distance for each mutation pair that indicates how often the two mutations occur in the same private branch of cells from the cluster:

dist(M1, M2) = 0 (for M1 = M2)
dist(M1,M2) = 1 - (%samples where M1 and M2 are both in the same private branch of a cell from the cluster) (elsewise)

A private branch is defined as the path from a leaf to the node just below the LCA of this leaf to another leaf from the same cluster.

This is a generalization of the earlier method to find the top seperating mutations of pairs of leafs. The generalization was necessary to handle the larger clusters that were broken in more than 2 pieces.

plum

clusterName <- "lightcoral"

d <- read.table(file.path(inputFolder, sampleName, paste0(sampleName, "_postSampling_", clusterName, ".txt")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
mat <- as.matrix(d)
mat[1:4, 1:4]
##                chr11_78017064 chrX_141003335 chr15_33864037 chrX_77691570
## chr11_78017064       0.000000       0.998925       0.999075      0.999250
## chrX_141003335       0.998925       0.000000       0.852025      0.863275
## chr15_33864037       0.999075       0.852025       0.000000      0.844025
## chrX_77691570        0.999250       0.863275       0.844025      0.000000

Position-wise coverage score

For each position, we computed the percentage of samples that have a coverage of at least 3 at this position. This is meant as a simple score of the data quality of a position that can be used in addition to the separation score to pick mutations for the wet lab experiments. Furthermore, we added simple functional annotations to the variants.

coverage <- read.table(file.path(inputFolder, sampleName, paste(sampleName, "covScore.txt", sep = "_")), header = TRUE, sep = "\t", stringsAsFactors = F, row.names = 1)
coverage$variantName <- rownames(coverage)
head(coverage)
##                 covScore    variantName
## chr11_78017064 0.8333333 chr11_78017064
## chrX_141003335 0.6666667 chrX_141003335
## chr15_33864037 0.5000000 chr15_33864037
## chrX_77691570  0.6666667  chrX_77691570
## chr18_51062723 0.5000000 chr18_51062723
## chrX_141003560 0.6666667 chrX_141003560
coverage <- inner_join(coverage, annotations, by = "variantName")

Method

Mutation clustering

  1. Overview: Raw plot of the distance matrix.
  2. Filter distant mutations: Remove all mutations that are not close to any other mutations (minDist>0.5)
  3. Dendrogram: Use the distance matrix to cluster the mutations using hierarchical clustering.
  4. Cluster remaining mutations: Re-do the hierarchical clustering witht the remaining mutations
  5. Define cut point to get about as many groups as there are cluster pieces
  6. Rank top separating mutations: Within each group, reduce distance matrix to mutations in the group, rank them by their average distance to other mutations in the group.

###Overview To get an overview, we plot the full distance matrix:

library(heatmaply)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Loading required package: viridisLite
## 
## ======================
## Welcome to heatmaply version 1.5.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================
heatmaply(mat)

Filter out distant mutations

mat2 <- mat
diag(mat2) <- 1
min_dist <- apply(mat2, 1, min) # find minimum distance to other mutations
selected_muts <- which(min_dist < 1) # select those below 0.5 say
mat3 <- mat[selected_muts, selected_muts]

This is what the distance matrix looks like now:

heatmaply(mat3)

Dendrogram of the remaining mutations

To cluster mutations, we create a dendrogram based on the pairwise distances:

mat <- mat3
d_mat <- as.dist(mat)
hc <- hclust(d_mat, "average") ## hierarchical clustering of mutations based on distance matrix
par(cex = 0.6)
plot(hc, main = "Dendrogram based on average pairwise distance", sub = "", xlab = "Separating mutations")

The case seems to be a bit clearer in this case: There is one single mutation that never cooccurs with any of the other mutations.

coverage %>% filter(variantName %in% colnames(mat2))
##     covScore    variantName REF ALT relevant
## 1  0.8333333 chr11_78017064   G   A     NONE
## 2  0.6666667 chrX_141003335   T   C     NONE
## 3  0.5000000 chr15_33864037   G   A     NONE
## 4  0.6666667  chrX_77691570   A   T     NONE
## 5  0.5000000 chr18_51062723   T   C     NONE
## 6  0.6666667 chrX_141003560   C   G MODERATE
## 7  0.6666667 chr19_54595789   C   G MODERATE
## 8  0.5000000 chr19_54576073   T   A MODERATE
## 9  0.5000000  chr9_33795599   T   C MODERATE
## 10 0.5000000 chr17_12111084   G   T     NONE
## 11 0.5000000 chr7_142760456   C   G     HIGH

However, the coverage score of chr11_78017064 sticks out. Could it be that this variant is found to be more distant than the others because we can determine its genotype with more confidence?

Identifying the clusters

We define a cut point to get distinct branches. These should roughly represent the mutations on the private branches in the posterior sampling. The number of clusters can either be the number of distinct branches in the posterior sampling or decided based on the hierarchical clustering.

par(cex = 0.6)
plot(hc, main = "Dendrogram based on average pairwise distance", sub = "", xlab = "Separating mutations")
abline(h = 0.98, lwd = 2, lty = 2, col = "green")

# geneGroups <- cutree(hc, k = NULL, h = 0.6)
geneGroups <- cutree(hc, k = 2)

Here we decide to cut the data into three clusters eventhough the MAP trees suggest that only two of the four cells from the CTC-cluster form distinct branches. In the MAP trees found for this case, the left-most branch corresponds to the mutations in the smaller private branch. The other two branches correspond to the larger private branch with the middle branch roughly consisting of the mutations lower down in the private branch. The right-most branch consists mostly of the mutations higher up in the branch, but also some mutations in the joint trunk.

Rank mutations within cluster

To rank the mutations within one cluster, we reduce the distance matrix to the cluster mutations and rank them by their average distance to the other mutations in the cluster.

First cluster:

cluster1 <- names(geneGroups)[geneGroups == 1]

Mutations in cluster:

## [1] "chr11_78017064"

This cluster only consists of one variant, so there is no need to rank the variants.

Technically we should exclude the value on the main diagonal from the average. But it is zero at all positions which means when only ordering the mutations based on the average, it makes no difference.

Top separating mutations (first branch):

top_df <- as.data.frame(0)
colnames(top_df) <- "avgDist"
top_df$variantName <- cluster1
top_muts_1 <- inner_join(top_df, coverage, by = "variantName")
# colnames(top_muts_1)[1] <- "mutation"
top_muts_1
##   avgDist    variantName  covScore REF ALT relevant
## 1       0 chr11_78017064 0.8333333   G   A     NONE
print(sprintf("Number of mutations in cluster %s with moderate functional impact: %d", clusterName, sum(top_muts_1 == "MODERATE")))
## [1] "Number of mutations in cluster lightcoral with moderate functional impact: 0"
print(sprintf("Number of mutations in cluster %s with high functional impact: %d", clusterName, sum(top_muts_1 == "HIGH")))
## [1] "Number of mutations in cluster lightcoral with high functional impact: 0"